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ABSTRACT 

In many systems, the need to display the signal amplitude with 
respect to frequency in real time system has led to the requirements of 
spectrum analysis. This work is concerned with the design and 
implementation FFT calculations based on (FPGA) technology. 

A new option is available for constructing the FFT Rader-Brenner 
algorithm, which is FPGA. The FPGA maintains the high specificity of 
the ASIC while avoiding its high development cost and its inability to 
accommodate design modifications after production. Highly adaptable 
and design-flexible, (FPGAs) provide optimal device utilization through 
conservation of board space and system power important advantages not 
available with many stand-alone DSP chips. In the work, the FPGA 
device used is based on simulation using Xilinx Foundation Series (F3.1i) 
and its simulator to simulate the schematic editor. The selected device for 
simulation is XC-4085 from Xilinx Inc. 

The method real time FFT used for implementation Rader-Brenner 
algorithm. The resulted operates successfully at a clock frequency upto 
15MHz for real time applications. The results show that the calculating 
the FFT using Rader-Brenner algorithm provides the best results when 



the complexity and speed are considered also used less gate numbers for 
implementation than FFT Radix-2 and Radix-4. 
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1. Introduction: 

To improve real-time embedded system development by integrating (FPGA) 
the use of modeling and simulation as a fundamental cornerstone in all development 
stages of system design(FFT). 

1. The engineering of real-time and embedded products presents significant 
challenges. Real-time performance requires time response, often utilizing 
priority-driven control flow. The embedded nature of the systems often 
results in application-specific hardware components, and may entail the co- 
design of both hardware and software. The deployment and support of the 
systems must often account for unique aspects of the larger structure in 
which the system is embedded. Testing under actual operating conditions 
may be impractical, and in some cases impossible (e.g. a satellite control 
system). Models and Simulation are often used in the early development 



stages of these products [1: Trevor W. Pearce "Simulation-driven 
Architecture in the Engineering of Real-Time Embedded Systems", http// 
www.pearce@sce. Carleton.ca, 2003.]. 
2. The system that is required to obtain data, emit data, or interact with its 
environment at precise times, is said to be a real-time system. All computer 
systems are in some sense real-time systems. For controlling processes or 
mechanisms, a real-time system is a system whose temporal performance 
(response time, data-acquisition period, etc.) is critical importance to the 
industrial systems to which it is connected [2: Peter D. L. "Real-Time 
Microcomputer System Design", Mc Graw-Hill, Inc., 1988.]. 
2. Fast Fourier Transform: 

In 1965 Cooley and Tukey introduced an efficient way, later named the FFT, 
to implement the DFT. The most popular FFT algorithm is called the radix-2 FFT, 
and requires a size of FFT that is a power of two. The FFT is not a single algorithm 
but rather a large family of algorithms, which can increase the computational 
efficiency of the DFT. 

The FFT operates by decomposing an N point's time domain signal into N 
time domain signals each composed of a single point. The second step is to calculate 
the N frequency spectra corresponding to these N time domain signals. Lastly, the N 
spectra are synthesized into a single frequency spectrum. There are basically two 
algorithms in FFT. One is called DIT (Decimation in time) and the other DIF 
(Decimation in frequency). 

For both software and hardware implementations of Equations ( 1 and 2) 

N-l 

[-. \^ r ,, j(27ink/N) \ 
n] = Z c k e (i) 

k=0 

and 

C k =^Zx[n]e- J(2mk/N) (2) 

JN n=0 

The computational efficiency is usually expressed by the number of 
complex multiplications and additions required or, simply, by the number of 
operations. Straight implementation of either Equation (1 or 2) leads to 

N 2 operations. Typically, FFT algorithms can reduce this number to Nlog 2 N. For 



N=1024, the FFT algorithm is 100 times faster than the direct implementation of 
Equation (1 or 2). 

The essence of all FFT algorithms is the periodicity and symmetry of the 
exponential term in Equations (1 and 2), and the possibility of breaking down a 
transform into a sum of smaller transforms for subsets of data. Since n and k are both 
integers, the exponential term is periodic with period N 

[3: J.G. Proakis and D.G. Manolakis, "Digital Signal Processing: Principles, 
Algorithms, and Applications", 2nd ed., New York: Macmillan, 1992] 
[4:J.S. Bendat and A.G. Piersol, Random Data: "Analysis and Measurement 
Procedures", 2nd ed., New York: John Wiley & Sons, 1986] 

[5:J.W. Cooley and J.W. Tukey, "An algorithm for the machine computation of 
complex Fourier series", Math. Computation, 19: 297-301, 1965]. 
3.The Rader-Brenner radix-2 FFT DIF algorithm; 

The evaluation of DFTs by conventional FFT algorithm requires 
complex multiplications. We shall show now that a simple complex multiplications 
modification of the FFT algorithm replaces these complex multiplications by 
multiplications of complex number by either a pure real or pure imaginary number. 
This is realized by computing an N-point 
DFT, withN=2 n . 



X[k] = £x[n]W^ (3) 
where k=0,l, ,N-1 



via decimation in frequency radix-2 FFT form, which for k even, replaced k by 2k. 
X[2k] = X Mn] + x[n + f ]}W^ nk (4) 

n=0 

And for k odd replace k by 2k+l 

X[2k + 1] = X {x[n] - x[n + f ] } W N n W^ nk (5) 

n=0 

where k=0,l, ,N/2-l for k even and odd. 

Thus, the first stage of decimation in frequency FFT decomposition replaces one DFT 
of length N by two DFTs of length N/2 at the cost of N complex additions and N/2 



complex multiplications. In order to simplfy the calculation of the DFT X[2k+1], 

we define the (N/2)-point auxiliary sequence a n by. 
a„ = (x[n] - x[n + f ] } /[2 cos(2jm / N)] (6) 
Where n*0,N/4 
ao = 0 and a N /4 = ° 

We then compute the (N/2)-point DFT A[k] of a n . 

A[k] = N £"a n W N 2nk , ,k=0, N/2-1. (7) 

n=0 

X[2k + 1] Can be recovered from A[k] by noting that. 

A[k] + A[k + 1] = X a n (1 + W 2n )W£* (8) 

n=0 

Or 

A[k] + A[k + 1] = N f;[2a n cos(27m/N)]W N n )W^ ,lk (9) 



And since W N 4 = ~h 

X[2k + 1] = A[k] + A[k + 1] + Vo , For k even ( 1 0) 

X[2k + l] = A[k] + A[k + l] + Vl , Fork odd. (11) 
With 

Vo = (40] - x[f ]) - j(x[f] - X [f]) (12) 

\ = (x[0] - x[f ]) + j(x[f ] - x[f]) (13) 

Under these conditions the N/2 complex multiplications by the twiddle factors yy" in 
the first stage are replaced with (N/2)-2 multiplications by the pure real 
numbers Q n = l/[2cos(27Tn/N)] . Note here that the contributions of 

(x[0] - x[y]) and (x[f ] - x[^f]) must be treated separately, because 
cos(27tn/N) = 0 for n=N/4. 

The same method is used recursively to compute the (N/2)-point transforms 
X[2k] and A[k] , and then transforms of dimensions N/4,N/8, . .., until decomposition 
is achieved. Since the multiplication of complex number by a scalar value is 
implemented with two real multiplications, each stage is computed with N-4 



nontrivial real multiplications. We need also N complex additions for evaluating 
(x[0]-x[f]) and (x[f] - xPf])plus N+2 complex additions for calculating 

Equation (12-13). However, two complex additions are saved in the computation of 

A[k], because a 0 = 0 and a n/4 = 0. Thus, for each stage, the number of real 

multiplications M and real additions A become. 

M=N-4 

A=4N 

The two last stages of the decomposition correspond to transforms of 
dimensions, 4 and 2 which are computed by the conventional FFT methods with 
trivial multiplications by 1 and j . 

[6: Henri J. Nussbaumer, "Fast Fourier Transform and Convolution Algorithms", 
Springer- Verlag, Berlin, Heidelberg, NewYork, 1980] 

[7: C. M. Rader and N. M. Brenner "A New principle for Fast Fourier Transform", 
IEEE Trans. ASSP-24, 264-265, 1976]. 

4. Design of FFT Rader-Brenner radix-2 FFT DIF algorithm: 

The design will be introduced as top to bottom levels. It will be discussed as 
single unit that is divided into small units, where each one of them will be discussed 
separately. The descriptions of these designed components of the project were written 
in VHDL and schematic editor of Xilinx Foundation (F3.1i) before implemented by 
using the FPGA digital technologies. 

4.1 Top Level Design: 

The FFT can be presented as processing element (FPGA chip) that accepts 
data from a source (A/D) and writes its output data into another source (RAM), as 
shown in Figure (1). 

The data feeds into the FPGA chip with a clock supplied by the data source 
itself. This operation makes the design applicable for real time applications. For the 
output data, it is generated with the same frequency that is introduced to the chip. 
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Figure (1): Block diagram of the system with FPGA chip implementation. 
The FPGA chip contains the main elements of our top-level design, which is 
described by Figure (2). 

As shown in the Figure, the main units of the FFT are: 

a. The input unit (serial in parallel out). 

b. The FFT unit 

c. The output unit (parallel in serial out). 
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Figure (2): Block diagram of top-level design implementation. 



4.2 Rader-Brenner Radix-2 DIF FFT: 

The algorithm of 8-point Rader-Brenner Radix-2 DIF FFT is shown in Figure 

(3). 




Figure (3): 8-point Rader-Brenner DIF FFT algorithm. 

In this method the four multiplications by twiddle factors W n in Radix-2 FFT 
are reduced by two real multiplications in the first stage and the subtraction of 
(x[0] - x[4])and (x[2] - x[6]) are saved to be added in last stage of FFT, but in 



the last two stages the travel multiplication by (-1,+1,-j and +j) as in Radix-2 DIF 
FFT, The algorithm of 8-point Rader-Brenner is shown in Figure (3). 

The above representation of butterflies is described in Figure (4) by adder and 
subtractor with out multiplier because the output subtactor of the first and third 
butterfly are saved to the third stage of FFT. 




y 

Figure (4): The representation of a Rader-Brenner Radix-2 DIF FFT butterflies. 
The expressions describing x* and y* are: 
x* = x + y 

y* = x - y 

The inputs to the butterfly have to be described with real and imaginary parts. This 
gives: 

X = X„ + X, 

Re Im 
y ~~ yRc Ylm 

The output from butterfly will also consist of real and imaginary part and x can now 
be expressed as: 

X Rc - X Rc + y R c 



X Im X Im y Im 

and the output y is expressed as: 

yRe _ X Re yRe 
yim X Im yim 

Taking these expressions for x* and y* a butterfly component is built as shown in 
Figure (5). 
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Figure (5) Butterfly built of adder and subtactor in block set. 
Under these conditions the 4 complex multiplications (N/2) by the twiddle 

factors \y" in the first stage are replaced with 2 multiplications (N/2)-2 by the pure 

real numbers Q n = 1 /[2 cos(7rn / 4)] . 

The multiplication by the real data input is implemented using one real multiplication 
required to form the following. 

Z = y orx x Q n 

Then take the expressions of Z and building a multiplier in block set as shown in 
Figure (6). 
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Figure (6): A building a real multiplier component in block set. 
It is very important to have good control over the ports in and out from the butterfly 
because they are going to be connected with other units. 



The chosen structure is a Rader-Brenner. DIF structure, this choice sets the 
structure to connect the butterflies into an FFT. Another thing to consider is the bit 
input order and output order. In this case bit reversed input order is chosen and normal 
output order. This is to simplify the handling of the output data and to be able to have 
control over the implementation. 

Between any two stages, D-flip-flops are placed to control the signal rate and to 
pipeline the system for a faster throughput, where the same registers are used at the 
input to one butterfly as at the output of another one. 

With this requirement, an 8-point Rader-Brenner DIF FFT system design is 
proposed, as in schematic Figure (7) and the VHDL program of 64-point Rader- 
Brenner DIF FFT as in Appendix A 




Figure (7): schematic diagram of 8-point Rader-Brenner Radix-2 DIF FFT stage 

design. 

At the end of this method the internal structure and timing diagram of Radix-2 Rader- 
Brenner DIF FFT are shown in Figures (8). 
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Figures (8) Radix-2 Rader-Brenner DIF FFT 



5. System Implementation, Verification and Results: 

The implementation phase consists of choosing a target device and the 
software used to implement the design described in chapter four. After that, all the 
components are written in schematic editor and can be tested by using simulator of the 
software to compare the results with standard software such as (MATLAB). 

All the XC-4000 family has the same specifications except for their gate 
quantity in each device. Therefore, the target device is chosen after determining the 
number of gates (or CLB'S) needed for the design. 

The selected device is XC-4085 which contains 85000 gates (or 3136 CLB'S) 
on a single chip [31]. The selection is made for two reasons: first we need nearly 1800 
to 2000 CLB'S in our design (as will be shown in section 5.4), and the second it's for 
the availability and price. 

Xilinx incorporation, the vendor of the target device is also the developer of 
various kinds of software that is detected for FPGA devices. They have a famous 
series called Xilinx Foundation Series. This Foundation Series can program any 
FPGA device produced by Xilinx up to the date of the released version. Also they 
have more professional tool called Xilinx ISE (Integrated System Environment). The 
selected software is Xilinx Foundation Series (F3.1i) and its simulator is used to 
simulate the schematic editor. 



[8: Xilinx, "The Programmable Logic, Data Book 2000," San Jose, CA, 2000] 

[9: Zainalabedin Navabi, "VHDL: Analysis and Modeling of Digital Systems", 
Prentice Hall Inc., 1993] 

5.1 The Design Implementation: 

Using the Foundation Series (F3.1i) and its schematic editor and 
VHDL of all components described as in chapter four was into single 
chip FPGA as shown in Figure (9) in following steps: 




Figure (9): Process Implementation into single chip FPGA. 
In the digital design Stage the digital design is created as schematic digital and 
VHDL design editor. The schematic entry program utilizes graphic symbols of the 
circuitry. 

The last schematic digital design is shown in Figure (5.2). The output of this program 
produces netlists. One must be sure the library sets of the targeted FPGA are available 
in the tool selected. 

To check the designed system, clock and data inputs that represent samples of the 
main wave come from (A/D) device as seen in Table (1) and (2) for 8-point FFT and 
the Figure (10) show the result of 64-point FFT. Also the enable clock is set to active 
high and reset to low. All the above values can be generated using logic simulator 
software system by using insert waveform tool in Hexa form. 



Table (1): The real data input of the system 



No. of point in 
FFT 


Value in decimal 
Input to FFT 


Value in Hexa 
Input to FFT 


Value in 
decimal 
Output of FFT 


0 


-10 


F6 


-5 


1 


10 


OA 


37.67 -10.60j 


2 


20 


14 


-80 - 5i 


3 


-5 


FB 


2.32-10.60j 


4 


-30 


E2 


5 


5 


-10 


F6 


2.32 +10.60j 


6 


20 


14 


-80 + 5j 


7 


0 


00 


37.67 +10.60j 


Table (2): The imaginary data input of the system 


No. of point in 
FFT 


Value in decimal 
Input to FFT 


Value in Hexa 
Input to FFT 


Value in 
decimal 
Output of FFT 


0 


2+3j 


02+03j 


44 - 16j 


1 


3+4j 


03+04j 


29.79 + 16.82j 


2 


4+5j 


04+05j 


-4 + 4j 


3 


5+6j 


05+06j 


1.79+14.48j 


4 


6-1] 


06-F9j 


-4 


5 


7-8j 


07-F8j 


-9.79+1 1.1 7j 


6 


8-9j 


08-F7j 


-4-4j 


7 


9-10j 


09-F6j 


-37.79 - 2.48j 




Freq. MHz 



Figure (10): result of power 64-point FFT to two sine waves has different frequency 
The results in Hexa form are taken from simulator representing the data output of the 
system, also give the data output of all stages (top to bottom) in system corresponding 
to a clock. 

IF the simulation result is checked with (MATLAB) or FPGA IEEE results 
and has agreements value, where its value is multiplied by a factor equal to 1000 as 
shown in Table (1,2). Then go to next step (Design Implementation-Mapping) but if 
the result has no agreements, then we must return to the first step (Design and 
Implementation) to correct the design. 

In the Design Implementation stage, the netlist file produced by the design 
entry program is converted into the bit stream file, which configures the FPGA and 
the software gives translated report (Appendix B). 

The first step Maps the design on to the FPGA resources, the map report gave us all 
the details about how the design is mapped to fit inside the chip. As a result it contains 
important data 

Also the map report contains a list of the output signals and their assigned pins on the 
chip. 



The second step places or assigns logic blocks that are created in the mapping 
process in specific locations in the FPGA. The third step routes the interconnect 
paths between the logic blocks. Then the place and route report is delivered 
(Appendix B) that contains information about the components placing in the chip and 
also contains data about the timing delay. 

Also the pad report (Appendix B) is generated to illustrate the function of each pin of 
the chip. All the timing delay constraints were listed in the Asynchronous Delay 
report (Appendix B). This contained all the net delays inside the chip. 
The last reports its timing report (Appendix B) gives also some important timing 
constraints, like "Maximum Frequency". 

At the last the output from Design Implementation and mapping is a Logic Cell Array 
File (LCA) for the particular FPGA. This LCA file is then converted into a bit stream 
file for configuring the FPGA. 

The Design Verification Step tests the design's logic and timing using input 
stimuli. Various CAD software packages provide verification/simulation tools. These 
tools are designed to perform detailed characterization of the design, by performing 
both functional and timing simulations. In-circuit verification is another way to test 
the design. In-circuit verification tests the circuit under typical operating conditions. 
The Virtual Computer reconfigurable computer can be used as an in-circuit 
verification system. The output of this step is a simulation state, if the simulation fails 
we must go back to design entry, but if it passes then go to configuration FPGA. 
Figure (11) shows the propagation delay in system design. 

Configuration is a process in which the circuit design (bit stream file) is 
downloaded into the FPGA. The method of configuring the FPGA determines the type 
of bit stream file. FPGA's can be configured by a PROM. The serial PROM is the 
most common. The FPGA can either actively read its configuration data out of 
external serial or byte-parallel PROM (master mode), or the configuration data can be 
written into the FPGA (slave and peripheral mode), where the FPGA is used in a 
Reconfigurable Computing Platform, the bit stream file is converted into a High Level 
Language (i.e. 'C") function. Through this method the FPGA is configured from 
within an application program. 

To verify the work of the complete design, different kinds of signals have sinusoidal 
shape with DC. level but with different frequencies were applied to the chip. The 
entire input signals are sampled with Nyquist rate consisting of 8-bits. The result is 
shown in Figure (11). 
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Figure (1 1): System result and work at real time. 
6. Conclusions and Suggestions for Future Work: 



Single chip Rader-Brenner Radix-2 DIF FFT techniques are designed and 

implemented in this work. 

The designed system was tested using simulation software in PC. From the 
implementation report and simulation tests the following remarks have been 
concluded: 



The systems operate successfully for real time applications for frequencies 
upto 30MHz. The Rader-Brenner Radix-2 DIF FFT method is the highest 
frequency because the data bits flow through FFT less operation than other 
methods. 

The Rader-Brenner Radix-2 DIF FFT has the best chip design capacity due to 
number of multiplication being less than other methods. 

The system works in real time because the input samples represented coming 
from A/D when fed, the FFT results will be ready at the end of the K sample or at 
the K-clock pulse. At this moment the output data from the FFT unit will be 
produced. 



d. In order to graph the FFT magnitude vs. frequency in Hz, a frequency vector 
must be created. Because the Nyquist frequency (fs) occurs at point (n/2+1) of 
the N-point even length sequence, the frequency vector has (N/2+1) elements, 
evenly spaced between (0) and (fs) Hz. 
e. The Bit stream file contains the netlist file; this file can be downloaded in an 
EPROM, which is connected to the XC-4085 device. The design file will be fed to 
the device after the power up operation. 

The work can be expanded by design and implement the A/D in FPGA chip. 
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